Skip to content

Conversation

@ChughShilpa
Copy link
Contributor

@ChughShilpa ChughShilpa commented Nov 14, 2025

Add test coverage for JobSet Workflow Orchestration

Description

How Has This Been Tested?

go test ./tests/trainer/ -run TestJobSetWorkflow -v

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • Tests

    • Added utilities to locate Jobs and JobSets in test clusters.
    • Added end-to-end JobSet workflow tests covering success and failure scenarios.
    • Added helpers to create, configure and tear down training resources (TrainingRuntime, TrainJob, PVC) and to exercise initializer behavior, including a failing-initializer case.
  • Chores

    • Integrated JobSet API support into the test client to enable JobSet operations in tests.

@openshift-ci
Copy link

openshift-ci bot commented Nov 14, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chipspeak for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Nov 14, 2025

Walkthrough

Adds test helpers to find Jobs and retrieve a single JobSet, extends the test client with a JobSet client accessor, and adds two end-to-end tests validating successful and failing JobSet initializer workflows using TrainingRuntime, TrainJob, and a shared PVC.

Changes

Cohort / File(s) Change Summary
Test Support: Job helpers
tests/common/support/jobs.go
Add GetJobByNamePattern(test Test, namespace string, pattern string) (*batchv1.Job, error) — marks helper, lists BatchV1 Jobs via client, returns first Job whose name contains the pattern or (nil, nil) if none; handles list errors.
Test Support: JobSet helper
tests/common/support/jobset.go
Add GetSingleJobSet(test Test, namespace string) (*jobsetv1alpha2.JobSet, error) — marks helper, lists JobSets in namespace via JobSet client, asserts exactly one exists, returns it or error.
Test Client: JobSet integration
tests/common/support/client.go
Add JobSet() jobsetclient.Interface to Client and testClient; add jobset field; initialize JobSet client in newTestClient via jobsetclient.NewForConfig(cfg) and return via accessor; add import sigs.k8s.io/jobset/client-go/clientset/versioned.
Trainer tests: JobSet workflows
tests/trainer/jobset_workflow_test.go
Add tests TestJobSetWorkflow and TestFailedJobSetWorkflow and helpers createTrainingRuntimeWithInitializers, createTrainJobWithInitializers, createTrainJobWithFailingInitializer, deleteTrainingRuntime; tests create namespace/PVC, create TrainingRuntime (JobSet with 3 replicatedJobs), create TrainJob(s), assert JobSet contents and TrainJob success/failure, and clean up.

Sequence Diagram(s)

sequenceDiagram
    participant Test
    participant K8s as Kubernetes API
    participant Client as Test Client
    participant JobSetCtrl as JobSet Controller
    participant Jobs as Replicated Jobs

    rect rgb(240,248,255)
    Note over Test,K8s: Setup resources
    Test->>K8s: create Namespace & PVC
    Test->>K8s: create TrainingRuntime (contains JobSet spec)
    end

    rect rgb(245,255,240)
    Note over Test,JobSetCtrl: Execution
    Test->>K8s: create TrainJob referencing TrainingRuntime
    JobSetCtrl->>Jobs: instantiate replicatedJobs (dataset/model/node)
    Jobs-->>JobSetCtrl: report status (succeeded / failed)
    end

    rect rgb(255,250,240)
    Note over Test,Client: Verification & cleanup
    Test->>Client: JobSet().List / GetSingleJobSet(namespace)
    Test->>Client: GetJobByNamePattern(namespace, pattern)
    Test->>K8s: assert TrainJob status (succeeded / failed)
    Test->>K8s: delete TrainingRuntime
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Inspect typed JobSet return and imports in tests/common/support/jobset.go.
  • Verify JobSet client initialization and error handling in tests/common/support/client.go.
  • Review test timing/polling and cleanup in tests/trainer/jobset_workflow_test.go for flakiness.

Poem

🐰 I hop where JobSets dance in three,
I sniff for names and fetch with glee,
PVCs shared, initializers hum,
Some jobs finish, some come undone,
I wiggle whiskers — tests run free.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and directly reflects the primary change: adding test coverage for JobSet Workflow Orchestration, which aligns with the three new test files and helper functions added.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/common/support/jobs.go (1)

26-41: Tighten not‑found semantics and avoid pointer-to-loop-variable confusion

Two small improvements you might consider:

  1. Returning nil, nil for “not found” is a bit ambiguous; a signature like (*batchv1.Job, bool, error) (or documenting the nil,nil case clearly) would make call sites more robust.
  2. For clarity, prefer taking the address of the slice element instead of the loop variable, e.g.:
-func GetJobByNamePattern(test Test, namespace, pattern string) (*batchv1.Job, error) {
+func GetJobByNamePattern(test Test, namespace, pattern string) (*batchv1.Job, error) {
 	test.T().Helper()

 	jobs, err := test.Client().Core().BatchV1().Jobs(namespace).List(test.Ctx(), metav1.ListOptions{})
 	if err != nil {
 		return nil, err
 	}

-	for _, job := range jobs.Items {
-		if strings.Contains(job.Name, pattern) {
-			return &job, nil
-		}
-	}
+	for i := range jobs.Items {
+		if strings.Contains(jobs.Items[i].Name, pattern) {
+			return &jobs.Items[i], nil
+		}
+	}

 	return nil, nil
 }

This avoids the common “pointer to range variable” gotcha and makes the intent explicit.

tests/trainer/jobset_workflow_test.go (1)

131-357: TrainingRuntime / JobSet spec is consistent; consider small extraction to reduce duplication

The createTrainingRuntimeWithInitializers helper wires three ReplicatedJobs with the expected dependencies and shared workspace PVC, and the resource requests/limits look reasonable for tests. The structure is clear and matches the intended workflow.

If you ever find this growing or needing reuse, you could optionally extract the common Volume/VolumeMount and BackoffLimit/RestartPolicyNever pieces into small local helpers to reduce repetition, but it’s fine as-is for readability.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54395f4 and e5e350d.

📒 Files selected for processing (3)
  • tests/common/support/jobs.go (1 hunks)
  • tests/common/support/jobset.go (1 hunks)
  • tests/trainer/jobset_workflow_test.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tests/common/support/jobset.go (2)
tests/common/support/test.go (2)
  • Test (34-45)
  • T (90-102)
tests/common/support/client.go (1)
  • Client (39-50)
tests/common/support/jobs.go (3)
tests/common/support/test.go (2)
  • Test (34-45)
  • T (90-102)
tests/common/support/batch.go (1)
  • Job (27-33)
tests/common/support/client.go (1)
  • Client (39-50)
tests/trainer/jobset_workflow_test.go (9)
tests/common/support/test.go (3)
  • T (90-102)
  • With (61-63)
  • Test (34-45)
tests/common/support/core.go (2)
  • CreatePersistentVolumeClaim (242-276)
  • AccessModes (235-240)
tests/common/support/jobset.go (1)
  • GetSingleJobSet (38-49)
tests/common/support/support.go (3)
  • TestTimeoutShort (33-33)
  • TestTimeoutLong (35-35)
  • TestTimeoutMedium (34-34)
tests/common/support/utils.go (2)
  • Log (34-34)
  • Ptr (27-29)
tests/common/support/trainjob.go (4)
  • TrainJob (26-32)
  • TrainJobConditionComplete (42-44)
  • TrainJobConditionFailed (46-48)
  • TrainJobs (34-40)
tests/kfto/environment.go (2)
  • GetAlpacaDatasetImage (36-38)
  • GetBloomModelImage (32-34)
tests/common/support/environment.go (1)
  • GetTrainingCudaPyTorch28Image (103-105)
tests/common/support/client.go (1)
  • Client (39-50)
🔇 Additional comments (4)
tests/common/support/jobset.go (1)

27-48: JobSet helper is clear and idiomatic for test usage

The jobsetOperatorGVR definition and GetSingleJobSet helper are straightforward and fit well with the existing Test/Gomega patterns. Asserting a single JobSet via HaveLen(1) before returning the first item is reasonable in this test-only context.

tests/trainer/jobset_workflow_test.go (3)

80-129: Failed workflow test logic looks solid and exercises JobSet status conditions

The TestFailedJobSetWorkflow path correctly:

  • Waits for the JobSet to exist via GetSingleJobSet.
  • Polls status.conditions with unstructured.NestedSlice and explicitly errors when the field is missing.
  • Locates a "Failed" condition with status == "True", captures the message, and asserts the TrainJob eventually transitions to TrainJobFailed.

This gives good coverage of the failure behavior and ensures the TrainJob reflects JobSet failure as expected.


359-519: TrainJob helpers align well with initializer semantics

Both createTrainJobWithInitializers and createTrainJobWithFailingInitializer:

  • Correctly reference the TrainingRuntime via RuntimeRef.
  • Configure dataset/model initializer env vars that match what the initializer Jobs expect.
  • Use clear shell commands to validate dataset/model visibility (success case) or ensure the trainer won’t run if the initializer fails (failure case).

These helpers nicely encapsulate the TrainJob setup for the two scenarios and keep the tests themselves concise.


521-534: Runtime cleanup helper is safe and test-friendly

deleteTrainingRuntime wraps the delete call and logs a warning instead of failing the test on cleanup errors, which is a good pattern for integration tests where teardown shouldn’t mask primary failures.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e5e350d and 8609f3a.

📒 Files selected for processing (3)
  • tests/common/support/jobs.go (1 hunks)
  • tests/common/support/jobset.go (1 hunks)
  • tests/trainer/jobset_workflow_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/common/support/jobset.go
🧰 Additional context used
🧬 Code graph analysis (2)
tests/trainer/jobset_workflow_test.go (10)
tests/common/support/test.go (3)
  • T (90-102)
  • With (61-63)
  • Test (34-45)
tests/common/test_tag.go (1)
  • Tags (32-40)
tests/common/support/core.go (2)
  • CreatePersistentVolumeClaim (242-276)
  • AccessModes (235-240)
tests/common/support/jobset.go (1)
  • GetSingleJobSet (38-49)
tests/common/support/support.go (3)
  • TestTimeoutShort (33-33)
  • TestTimeoutLong (35-35)
  • TestTimeoutMedium (34-34)
tests/common/support/utils.go (2)
  • Log (34-34)
  • Ptr (27-29)
tests/common/support/trainjob.go (4)
  • TrainJob (26-32)
  • TrainJobConditionComplete (42-44)
  • TrainJobConditionFailed (46-48)
  • TrainJobs (34-40)
tests/kfto/environment.go (2)
  • GetAlpacaDatasetImage (36-38)
  • GetBloomModelImage (32-34)
tests/common/support/environment.go (1)
  • GetTrainingCudaPyTorch28Image (103-105)
tests/common/support/client.go (1)
  • Client (39-50)
tests/common/support/jobs.go (3)
tests/common/support/test.go (2)
  • Test (34-45)
  • T (90-102)
tests/common/support/batch.go (1)
  • Job (27-33)
tests/common/support/client.go (1)
  • Client (39-50)
🔇 Additional comments (1)
tests/trainer/jobset_workflow_test.go (1)

56-74: LGTM: Eventually block correctly handles missing field.

The code properly handles the case when spec.replicatedJobs is not found by returning an error (lines 67-68), ensuring that Eventually keeps polling until the field exists. This addresses the concern raised in the previous review.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8609f3a and fc6e728.

📒 Files selected for processing (3)
  • tests/common/support/jobs.go (1 hunks)
  • tests/common/support/jobset.go (1 hunks)
  • tests/trainer/jobset_workflow_test.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tests/common/support/jobset.go (2)
tests/common/support/test.go (2)
  • Test (34-45)
  • T (90-102)
tests/common/support/client.go (1)
  • Client (39-50)
tests/common/support/jobs.go (3)
tests/common/support/test.go (2)
  • Test (34-45)
  • T (90-102)
tests/common/support/batch.go (1)
  • Job (27-33)
tests/common/support/client.go (1)
  • Client (39-50)
tests/trainer/jobset_workflow_test.go (9)
tests/common/support/test.go (3)
  • T (90-102)
  • With (61-63)
  • Test (34-45)
tests/common/support/core.go (2)
  • CreatePersistentVolumeClaim (242-276)
  • AccessModes (235-240)
tests/common/support/jobset.go (1)
  • GetSingleJobSet (38-49)
tests/common/support/support.go (3)
  • TestTimeoutShort (33-33)
  • TestTimeoutLong (35-35)
  • TestTimeoutMedium (34-34)
tests/common/support/utils.go (2)
  • Log (34-34)
  • Ptr (27-29)
tests/common/support/trainjob.go (4)
  • TrainJob (26-32)
  • TrainJobConditionComplete (42-44)
  • TrainJobConditionFailed (46-48)
  • TrainJobs (34-40)
tests/kfto/environment.go (2)
  • GetAlpacaDatasetImage (36-38)
  • GetBloomModelImage (32-34)
tests/common/support/environment.go (1)
  • GetTrainingCudaPyTorch28Image (103-105)
tests/common/support/client.go (1)
  • Client (39-50)
🔇 Additional comments (5)
tests/common/support/jobset.go (1)

27-49: LGTM! Clean test helper for JobSet retrieval.

The implementation correctly uses the dynamic client with the JobSet v1alpha2 API. Error handling is appropriate: the function returns errors from the List operation and uses a gomega assertion to verify exactly one JobSet exists, which is the expected behavior for a test helper.

tests/common/support/jobs.go (1)

26-42: LGTM! Loop variable issue correctly resolved.

The past critical issue regarding taking the address of a loop variable has been properly fixed. The code now iterates by index and returns &jobs.Items[i], avoiding the pointer-to-reused-variable bug.

One minor observation: the function returns (nil, nil) when no job matches the pattern. While this may be intentional for your test usage, consider documenting this behavior in a comment if it's not immediately obvious to future readers.

tests/trainer/jobset_workflow_test.go (3)

56-74: Well done fixing the NestedSlice polling logic.

The past issue where Eventually could succeed when spec.replicatedJobs was missing has been correctly addressed. The code now properly separates the error check from the found check (lines 64-69), ensuring the polling loop continues until the field actually exists.


102-143: Excellent error handling improvements in the failure detection logic.

The past issues with unchecked type assertions and ignored NestedString return values have been thoroughly addressed:

  1. Line 118: Type assertion now uses the comma-ok form with proper fallback
  2. Lines 122-125, 127-130, 132-138: All NestedString calls now check both found and err before using values

The current implementation is robust and handles missing or malformed fields gracefully.


151-554: Well-designed test helpers and workflow structure.

The helper functions create a realistic multi-stage JobSet workflow:

  • Dataset initializer → Model initializer → Trainer node, with proper DependsOn chains (lines 237-241, 307-311)
  • BackoffLimit: 0 throughout prevents retry noise in test runs
  • Shell scripts include appropriate validation (e.g., FAIL_ON_PURPOSE check at lines 190-193)
  • Cleanup helper logs warnings rather than failing tests (lines 549-553)

The test coverage effectively validates both success and failure paths for the JobSet orchestration.

@abhijeet-dhumal
Copy link
Contributor

@ChughShilpa These JobSet based sanity tests looks awesome 🚀
Just adding some nit picks ..

Model: &trainerv1alpha1.ModelInitializer{
Env: []corev1.EnvVar{
{
Name: "MODEL_NAME",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick - Model name is not needed as model is hardcoded.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tests/trainer/jobset_workflow_test.go (2)

174-179: Consider adding resource limits to initializers.

Based on a previous review comment, resource limits should be added to the initializer containers for consistency with the node container (lines 298-307) and to prevent unbounded resource consumption.

Apply this diff to add limits to the dataset-initializer:

 Resources: corev1.ResourceRequirements{
 	Requests: corev1.ResourceList{
 		corev1.ResourceCPU:    resource.MustParse("100m"),
 		corev1.ResourceMemory: resource.MustParse("128Mi"),
 	},
+	Limits: corev1.ResourceList{
+		corev1.ResourceCPU:    resource.MustParse("200m"),
+		corev1.ResourceMemory: resource.MustParse("256Mi"),
+	},
 },

Apply a similar change to the model-initializer resources at lines 244-249.

Based on past review comments.

Also applies to: 244-249


376-378: Optional: MODEL_NAME env var is not functionally used.

As noted in a previous review comment, the MODEL_NAME environment variable is only used for logging at line 234, while the actual copy operation at line 238 hardcodes the path /models/bloom-560m. You could either remove the env var or update the script to use it.

Example to make the script use the variable:

 echo "Model: ${MODEL_NAME}"
 echo "Target path: ${MODEL_PATH}"
 echo "Copying Bloom-560M model to shared volume ..."
 mkdir -p ${MODEL_PATH}
-cp -r /models/bloom-560m ${MODEL_PATH}/
+cp -r /models/${MODEL_NAME} ${MODEL_PATH}/
 ls -la ${MODEL_PATH}/

Based on past review comments.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fc6e728 and 9de4c34.

📒 Files selected for processing (4)
  • tests/common/support/client.go (6 hunks)
  • tests/common/support/jobs.go (1 hunks)
  • tests/common/support/jobset.go (1 hunks)
  • tests/trainer/jobset_workflow_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/common/support/jobset.go
🧰 Additional context used
🧬 Code graph analysis (2)
tests/common/support/jobs.go (3)
tests/common/support/test.go (2)
  • Test (34-45)
  • T (90-102)
tests/common/support/batch.go (1)
  • Job (27-33)
tests/common/support/client.go (1)
  • Client (40-52)
tests/trainer/jobset_workflow_test.go (9)
tests/common/support/test.go (3)
  • T (90-102)
  • With (61-63)
  • Test (34-45)
tests/common/support/core.go (2)
  • CreatePersistentVolumeClaim (242-276)
  • AccessModes (235-240)
tests/common/support/jobset.go (1)
  • GetSingleJobSet (27-38)
tests/common/support/support.go (3)
  • TestTimeoutShort (33-33)
  • TestTimeoutLong (35-35)
  • TestTimeoutMedium (34-34)
tests/common/support/utils.go (2)
  • Log (34-34)
  • Ptr (27-29)
tests/common/support/trainjob.go (4)
  • TrainJob (26-32)
  • TrainJobConditionComplete (42-44)
  • TrainJobConditionFailed (46-48)
  • TrainJobs (34-40)
tests/kfto/environment.go (2)
  • GetAlpacaDatasetImage (36-38)
  • GetBloomModelImage (32-34)
tests/common/support/environment.go (1)
  • GetTrainingCudaPyTorch28Image (103-105)
tests/common/support/client.go (1)
  • Client (40-52)
🔇 Additional comments (5)
tests/common/support/client.go (1)

30-30: LGTM! Clean JobSet client integration.

The JobSet client is properly integrated following the same pattern as other clients in the test framework. The implementation includes proper error handling and consistent naming conventions.

Also applies to: 44-44, 58-58, 82-84, 141-144, 185-185

tests/common/support/jobs.go (1)

26-42: LGTM! Helper correctly implemented.

The function properly handles the loop variable pointer issue (previously flagged) by iterating with an index and returning a pointer to the slice element. The error handling and return patterns are appropriate.

tests/trainer/jobset_workflow_test.go (3)

37-73: LGTM! Well-structured end-to-end test.

The happy-path test properly validates the JobSet workflow with initializers. The use of typed access (via GetSingleJobSet) and proper cleanup with defer statements are good practices.


75-118: LGTM! Failure path test correctly validates error propagation.

The test properly verifies that initializer failures propagate to the JobSet and TrainJob. The use of typed conditions is correct and avoids the unstructured field access issues that were previously flagged.


348-443: LGTM! Test helpers are well-implemented.

The helper functions follow consistent patterns:

  • Proper use of test.T().Helper() for stack traces
  • Appropriate error handling and logging
  • The failing initializer correctly uses FAIL_ON_PURPOSE=true to trigger the failure path in the runtime script

Also applies to: 445-508, 510-523

defer deleteTrainJob(test, namespace, trainJob.Name)

// Verify JobSet is created with 3 replicated jobs
test.Eventually(func() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you use the JobSet client then I think this check can be made simpler 1-2 liner, using WithTransform to get replicatedJobs

}
return nil
}, TestTimeoutShort).Should(Succeed())
test.T().Log("JobSet is created with 3 replicated jobs (dataset-initializer, model-initializer, node)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add here assertions verifying that dataset-initializer job is created (as the only job for now) and finished, after that model-initializer is created and finished and then node job is created and finished?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be needed to introduce some explicit sleep commands (a few seconds) in jobs to make sure that the assertions manage to verify it before job finishes.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/trainer/jobset_workflow_test.go (1)

37-73: Good test coverage for the success path.

The test properly validates the JobSet workflow with initializers, including namespace setup, resource creation, and verification of successful completion.

However, there are two optional improvements you could consider:

  1. Simplify the JobSet verification (lines 56-66): As sutaakar suggested, you could use WithTransform to make this check more concise:
test.Eventually(func(g Gomega) {
    jobset, err := GetSingleJobSet(test, namespace)
    g.Expect(err).NotTo(HaveOccurred())
    g.Expect(jobset.Spec.ReplicatedJobs).To(HaveLen(3))
}, TestTimeoutShort).Should(Succeed())
  1. Add sequential execution verification (per sutaakar's comment at line 67): You could add assertions to verify that jobs execute in the expected order (dataset-initializer → model-initializer → node), though this would require polling individual Job statuses and might need sleep commands in the jobs to make timing predictable.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9de4c34 and c91c3e4.

📒 Files selected for processing (1)
  • tests/trainer/jobset_workflow_test.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/trainer/jobset_workflow_test.go (9)
tests/common/support/test.go (3)
  • T (90-102)
  • With (61-63)
  • Test (34-45)
tests/common/support/core.go (2)
  • CreatePersistentVolumeClaim (242-276)
  • AccessModes (235-240)
tests/common/support/jobset.go (1)
  • GetSingleJobSet (27-38)
tests/common/support/support.go (3)
  • TestTimeoutShort (33-33)
  • TestTimeoutLong (35-35)
  • TestTimeoutMedium (34-34)
tests/common/support/utils.go (2)
  • Log (34-34)
  • Ptr (27-29)
tests/common/support/trainjob.go (4)
  • TrainJob (26-32)
  • TrainJobConditionComplete (42-44)
  • TrainJobConditionFailed (46-48)
  • TrainJobs (34-40)
tests/kfto/environment.go (2)
  • GetAlpacaDatasetImage (36-38)
  • GetBloomModelImage (32-34)
tests/common/support/environment.go (1)
  • GetTrainingCudaPyTorch28Image (103-105)
tests/common/support/client.go (1)
  • Client (40-52)
🔇 Additional comments (5)
tests/trainer/jobset_workflow_test.go (5)

75-118: Excellent failure path coverage.

The test properly validates that failures in initializer jobs propagate correctly through the JobSet to the TrainJob status. The condition checking logic is robust and handles missing fields appropriately.


120-354: Well-structured TrainingRuntime with proper resource management.

The helper creates a comprehensive TrainingRuntime with three replicated jobs in the correct dependency order. All containers have both resource requests and limits specified, which is essential for proper scheduling and resource management in Kubernetes.

The shell scripts are clear and handle error conditions appropriately (e.g., the FAIL_ON_PURPOSE flag at lines 159-162).


356-451: TrainJob helper correctly implements the initializer pattern.

The function properly configures both dataset and model initializers with environment variables, and the trainer command validates that the initialized assets are accessible.

Regarding sutaakar's comment about MODEL_NAME at line 385: While the image is hardcoded to bloom-560m, the MODEL_NAME environment variable is functionally necessary because the model-initializer script (line 242 in the TrainingRuntime) uses ${MODEL_NAME} to locate the model path: cp -r /models/${MODEL_NAME} ${MODEL_PATH}/.

This design follows the initializer pattern where the TrainingRuntime provides generic, reusable scripts and the TrainJob supplies the configuration parameters. It makes the runtime template more flexible for future use cases.


453-516: Failure test helper correctly configured.

The function properly sets up a TrainJob that will fail by injecting the FAIL_ON_PURPOSE=true environment variable, which triggers the error condition in the dataset-initializer script. The minimal trainer command is appropriate since the job should fail before reaching that stage.


518-531: LGTM!

The cleanup helper follows best practices with proper use of test.T().Helper() and appropriate logging for both success and failure cases. Using log warnings rather than assertions for cleanup failures is correct for deferred cleanup operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants